7 - Clustering

Author

CDN team

Published

Last compiled on 20 May, 2025

Introduction

Clustering is the process of grouping cell identities based on their gene expression profiles and takes place after dimensionality reduction. For scRNAseq data we use community detection algorithms, this is done by computing a K-nearest-neighbor graph and then grouping cells based on their shared nearest neighbors. To effectively obtain fine grained annotations, it should be done iteratively so the least amount of information is lost or misidentified.

In this notebook, we learn:

  • Clustering in Seurat requires the computation of a K-nearest-neighbor graph followed by the identification of clusters in this graph using the functions FindNeighbors() and FindClusters().

  • The k parameters in FindNeighbors() determines the connectivity of the graph. A higher the K sets a more highly connected graph which results in fewer clusters.

  • The resolution parameter in FindClusters() controls the granularity of the clustering. A higher resolution results in more clusters.

  • Clustering results can be evaluated quantitatively by looking at the sample and batch distribution across clusters, using the Simpson diversity index, and a silhouette analysis of the clusters.

Review

We left off (last session) filtering out the poor quality data without regard for its distribution. Last week, we learned how principal components are calculated, what a latent space is, and defined a kNN (k-nearest neighbors) graph.

To review those:

We take a cell x gene matrix, normalize it, and reduce its dimensions using PCA. We looked at the Elbow plot to determine the optimal number of PCs to capture the variance. After selecting the top 30 PCs, we generated a k-nearest neighbors (kNN) graph to represent the relationships between cells based on their gene expression profiles and ran FindClusters() to identify clusters of cells based on their neighborhood relationships. Kyle introduced Harmony as an integration technique to correct for batch effects and I just went over QC metrics including doublet detection that help us understand the quality and content of our data.

Glossary

PCA - A dimensionality reduction technique that reduces the number of dimensions in a dataset while retaining as much variance as possible. The first principal component captures the axis with most variance in the data, each subsequent component captures the next independent/orthogonal axis of variability accounting for the most variance left.

Latent Space - The latent space is the low-dimensional representation of the data that captures the most important features.

kNN Graph - A graph that represents the relationships between cells based on their gene expression profiles. Each cell is connected to its k (1, 20, 100) nearest neighbors in the graph.

SNN Graph - A graph that represents how many neighbors are shared between 2 cells. This ensures a more robust graph to outliers.

Resources

  1. Challenges in unsupervised clustering of single-cell RNA-seq data
  2. Current best practices in single-cell RNA-seq analysis: a tutorial
  3. Bioconductor
  4. Single Cell Best Practices

Load Libraries and Data

if (!requireNamespace("dplyr", quietly = TRUE))
    install.packages('dplyr')
if (!requireNamespace("tidyverse", quietly = TRUE))
    install.packages('tidyverse')
if (!requireNamespace("Seurat", quietly = TRUE))
    install.packages('Seurat')
if (!requireNamespace("colorBlindness", quietly = TRUE))
    install.packages('colorBlindness')
if (!requireNamespace("RColorBrewer", quietly = TRUE))
    install.packages('RColorBrewer')
if (!requireNamespace("cluster", quietly = TRUE))
    install.packages('cluster')
if (!requireNamespace("viridis", quietly = TRUE))
    install.packages('viridis')
if (!requireNamespace("ggplot2", quietly = TRUE))
    install.packages('ggplot2')
if (!requireNamespace("ggalluvial", quietly = TRUE))
    install.packages("ggalluvial") 
if (!requireNamespace("tidygraph", quietly = TRUE))
    install.packages("tidygraph") 
if (!requireNamespace("ggraph", quietly = TRUE))
    install.packages("ggraph") 
library(dplyr)
library(Seurat)
library(tidyverse)
library(RColorBrewer)
library(colorBlindness)
library(DoubletFinder)
library(cluster)
library(viridis)
library(ggplot2)
library(ggalluvial)
library(tidygraph)
library(ggraph)

set.seed(687)

Load Data

# Load the Seurat object with doublet and batch information
se <- readRDS('../data/se_qc.rds')
se 
An object of class Seurat 
33694 features across 22502 samples within 1 assay 
Active assay: RNA (33694 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 2 dimensional reductions calculated: pca, umap
# Set color palette for cell types
donor_pal <- c(
    "#66C2A4", "#41AE76", "#238B45", "#006D2C",
    "#41B6C4", "#1D91C0", "#225EA8", "#253494",
    "#FD8D3C", "#FC4E2A", "#E31A1C", "#BD0026",
    "#ad393b", "#800000", "#800050")

names(donor_pal) <- c(
    "T024", "T036", "T44", "T057", "T110", "T160", "T161", "T182",
    "T017", "T019", "T176", "T189", "T197", "T203", "T202"
)

Set Up

Construct the kNN graph with FindNeighbors():

FindNeighbors() is the Seurat function that calculates the k-nearest neighbors of each cell in PCA space. The number of neighbors used for this calculation is controlled by the k.param parameter with a default value of 30. It computes pairwise distances between cells based on their gene expression using algorithms such as: euclidean distance, manhattan distance, Pearson correlation distance, or cosine similarity.

From BioStars:

FindNeighbors() is a function that is used to find the nearest neighbors of your single cell data point within a dataset. It works by calculating the neighborhood overlap (Jaccard index) between every cell and its k.param nearest neighbors. It’s often employed in various applications such as anomaly detection and dimensionality reduction. The concept is that given a data point, you want to identify the closest data points to it based on some similarity metric, such as Euclidean distance or cosine similarity. This helps to identify similar points in the dataset, which can be useful for making predictions or understanding the distribution of the data.

The default values of FindNeighbors() are:

See ?FindNeighbors for more information. Let’s modify these to fit our analysis:

se <- se %>%
    FindNeighbors( 
        reduction = "pca",
        dims = 1:30,
        k.param = 20,
        verbose = FALSE
    )
# Look at the k-nearest neighbors (nn) and shared nearest neighbors (snn) graphs computed
se@graphs
$RNA_nn
A Graph object containing 22502 cells
$RNA_snn
A Graph object containing 22502 cells

FindClusters

FindClusters() is used for identifying clusters of cells based on their neighborhood relationships typically obtained from PCA or t-SNE. It inputs the graph made from FindNeighbors() and outputs cluster assignments for each cell found at se@meta.data$seurat_clusters.

The resolution parameter controls the granularity of the clustering. Higher values of resolution will result in more clusters, while lower values will result in fewer clusters. The default value is 0.8.

From BioStars:

FindClusters() is a function used for clustering data points into groups or clusters based on their similarity. It uses a graph-based clustering approach and a Louvain algorithm. Clustering is an unsupervised learning technique where the algorithm groups similar cells together without any predefined labels. The goal is to find patterns and structure in your data. The number of clusters and the algorithm used can vary based on the problem and data characteristics. Common clustering algorithms include K-means, hierarchical clustering, and DBSCAN.


where ‘algorithm’ =

1 - original Louvain algorithm

2 - Louvain algorithm with multilevel refinement

3 - SLM (Smart Local Moving) algorithm

4 - Leiden algorithm

See ?FindClusters for more information.

Clustering

Clustering is considered a classical unsupervised machine learning task. Seurat uses the community detection algorithm, Leiden, to identify cell clusters. A community detection algorithm identifies groups of nodes in a network that are densely connected. The Leiden algorithm connects cells by finding tightly connected communities in the shared nearest neighbor graph computed in FindNeighbors. Shared nearest neighbor graphs are more robust than k-nearest neighbors graphs. sNN graphs weight edges based on the number of shared edges with other cells. Leiden is a 2019 improvement on the Louvain algorithm so it is common to see both in scRNA-seq analysis.

In clustering, the goal is not to see how many clusters you can pull apart but it is an iterative process. Especially in the first pass, you want to pull apart main cell groups such as epithelial cells and immune cells so you can further refine clusters to extract more granular cell types in the next iteration.

After clustering, we’ll review some cluster validation techniques to qualitatively and quantitatively check the quality of the clustering results.

se <- FindClusters(
      object = se,
      resolution = c(0.01, 0.05, 0.1, 0.15, 0.2, 0.25),
      algorithm = 1) 
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 22502
Number of edges: 831450

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9971
Number of communities: 8
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 22502
Number of edges: 831450

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9901
Number of communities: 12
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 22502
Number of edges: 831450

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9843
Number of communities: 16
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 22502
Number of edges: 831450

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9795
Number of communities: 18
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 22502
Number of edges: 831450

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9750
Number of communities: 22
Elapsed time: 1 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 22502
Number of edges: 831450

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9708
Number of communities: 23
Elapsed time: 2 seconds
# Results seen at RNA_snn_res in metadata
colnames(se@meta.data)[str_detect(colnames(se@meta.data), "RNA_snn_res")]
[1] "RNA_snn_res.0.01" "RNA_snn_res.0.05" "RNA_snn_res.0.1"  "RNA_snn_res.0.15" "RNA_snn_res.0.2"  "RNA_snn_res.0.25"

KNN vs SNN

Subset Seurat object to visualize KNN and SNN

se_sub <- se[, sample(colnames(se), 1000)] # Subset to 1000 cells

se_sub <- se_sub %>%
    FindNeighbors(
        reduction = "pca",
        dims = 1:30,
        k.param = 10,
        verbose = FALSE,
        return.neighbor = FALSE,
        compute.SNN = TRUE
    )

Now let’s do some processing to visualize the KNN and SNN graphs

# Extract PCA
pca_coords <- se_sub@reductions$pca@cell.embeddings[, 1:3] %>%
    as.data.frame() %>%
    rownames_to_column("name")

## Edges for KNN graph
edges_knn <- data.frame(se_sub@graphs$RNA_nn, check.names = FALSE) %>%
    tibble::rownames_to_column("source") %>%
    pivot_longer(cols = -source, names_to = "target") %>%
    dplyr::filter(source != target & value > 0) %>%
    dplyr::select(-value)

## Edges for SNN graph
edges_snn <- data.frame(se_sub@graphs$RNA_snn, check.names = FALSE) %>%
    tibble::rownames_to_column("source") %>%
    pivot_longer(cols = -source, names_to = "target", values_to = "jaccard") %>%
    # Prune SNN - for this use case we need to have a jaccard index > 0.14
    dplyr::filter(source != target & jaccard > 1 / 7) %>%
    dplyr::select(-jaccard)

# Prune SNN - for this use case we need to have a jaccard index > 0.14. This is 
# a strict threshold for visualization purposes to highlight the robustness to 
# outliers

# Create graph objects
graph_knn <- tbl_graph(
        edges = edges_knn,
        nodes = pca_coords,
        directed = FALSE) %>%
    activate(nodes)

graph_snn <- tbl_graph(
        edges = edges_snn,
        nodes = pca_coords,
        directed = FALSE) %>%
    activate(nodes)

Visualize KNN and SNN graph

# Visualize KNN graph
pknn12 <- ggraph(graph_knn, layout = 'manual', x = PC_1, y = PC_2) +
    geom_edge_link(aes(alpha = 0.5), show.legend = FALSE) +
    geom_point(aes(x = PC_1, y = PC_2), color = "blue", size = 3) +
    theme_minimal() +
    labs(title = "KNN in PCA Space", x = "PC_1", y = "PC_2")

# Visualize SNN graph
psnn12 <- ggraph(graph_snn, layout = 'manual', x = PC_1, y = PC_2) +
    geom_edge_link(aes(alpha = 0.5), show.legend = FALSE) +
    geom_point(aes(x = PC_1, y = PC_2), color = "orange", size = 3) +
    theme_minimal() +
    labs(title = "SNN in PCA Space", x = "PC_1", y = "PC_2")

pknn12 | psnn12

# Looking also PC1 vs PC3 for a different perspective
pknn13 <- ggraph(graph_knn, layout = 'manual', x = PC_1, y = PC_3) +
    geom_edge_link(aes(alpha = 0.5), show.legend = FALSE) +
    geom_point(aes(x = PC_1, y = PC_3), color = "blue", size = 3) +
    theme_minimal() +
    labs(title = "KNN in PCA Space", x = "PC_1", y = "PC_3")

# Visualize SNN graph
psnn13 <- ggraph(graph_snn, layout = 'manual', x = PC_1, y = PC_3) +
    geom_edge_link(aes(alpha = 0.5), show.legend = FALSE) +
    geom_point(aes(x = PC_1, y = PC_3), color = "orange", size = 3) +
    theme_minimal() +
    labs(title = "SNN in PCA Space", x = "PC_1", y = "PC_3")

pknn13 | psnn13

We can see that by looking at the SNN graphs we prune connections between extreme points and, therefore, improve the robustness of our clustering.

RunUMAP

UMAP runs on the PCA space. The dims parameter specifies which dimensions of the PCA space to use for the UMAP calculation. The default value is 1:30, which uses all dimensions. The n.components parameter specifies the number of dimensions in the UMAP embedding. The default value is 2.

se <- se %>%
    RunUMAP(dims = 1:30, verbose = FALSE) # Run UMAP 

Visualize Clusters

DimPlot(
    se,
    group.by = c(
        "RNA_snn_res.0.01", "RNA_snn_res.0.05",
        "RNA_snn_res.0.1", "RNA_snn_res.0.15",
        "RNA_snn_res.0.2", "RNA_snn_res.0.25"),
    label = TRUE
)

Community detection doesn’t follow a hierarchical clustering process. The algorithm splits the graph into more communities at higher resolution and does not subcluster based on major clusters that are present at lower resolutions. Clustering at a low resolution splits the dataset into a certain number of groups while clustering at a high resolution draws different lines on that same larger corpus of data. Below we show how this method of clustering (community detection) is different from hierarchical clustering.

se@meta.data %>%
    dplyr::count(
        RNA_snn_res.0.01, RNA_snn_res.0.05,
        RNA_snn_res.0.1, RNA_snn_res.0.25) %>% 
    ggplot(
        aes(axis1 = RNA_snn_res.0.01, axis2 = RNA_snn_res.0.05,
            axis3 = RNA_snn_res.0.1, axis4 = RNA_snn_res.0.25,
            y = n)) +
    geom_alluvium(aes(fill = RNA_snn_res.0.01), width = 0.1) +
    geom_stratum(width = 0.1) +
    geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
    scale_x_discrete(
        limits = c("RNA_snn_res.0.01", "RNA_snn_res.0.05",
                   "RNA_snn_res.0.1", "RNA_snn_res.0.25"),
        expand = c(0.15, 0.05)) +
    scale_fill_brewer(palette = "Dark2") +
    labs(title = "Compare Cluster Resolution of 0.01, 0.05, 0.1 and 0.25",
         x = "Clusters",
         y = "Count") +
    theme_minimal()

We see how in this case it follows a pretty good hierarchical splitting across higher resolutions some cells do tend to branch out into other clusters.

Different Resolutions

0.05 vs 0.1 Resolution

DimPlot(
    se,
    group.by = c("RNA_snn_res.0.05", "RNA_snn_res.0.1"),
    label = TRUE
    )

Comparing the 0.05 and the 0.1 resolutions, Cluster 0 in the 0.05 resolution appears to split into 2 looking at Clusters 1 and 2 in resolution 0.1. While this seems helpful in sifting through the data, we’ve decided to keep the large blobs together. This saves us visibility of our dataset in the second level of annotation. Resolution 0.05 has the best distribution, but let’s quantitatively analyze it’s clusters quality.

Cluster Metrics

Cluster Diversity

Splitting the clusters by cluster size and sample diversity allows us to make sure not one sample is being clustered separately from others.

# seurat_clusters <- "RNA_snn_res.0.05"
diversityPerGroup <- function(df, species, group, diversity_metric = 'simpson') {
  # Convert strings to symbols for curly-curly operator
  species_sym <- rlang::sym(species)
  group_sym <- rlang::sym(group)
  # Count groups per species directly using curly-curly
  tierNcount <- df %>%
    group_by({{species_sym}}) %>%
    count({{group_sym}}, name = "n") %>% ungroup
  # Pivot table to allow vegan::diversity call
  tierNwide <- tierNcount %>%
    pivot_wider(names_from = {{group_sym}}, values_from = n, values_fill = list(n = 0))
  # Use rownames of species for the diversity function, which needs a dataframe
  tierNwide_df <- as.data.frame(tierNwide)
  # species column must be first
  tierNwide_df <- tierNwide_df %>% select({{species}}, everything())
  rownames(tierNwide_df) <- tierNwide_df[, 1]
  tierNwide_df <- tierNwide_df[, -1]
  # Calculate diversity
  diversity_values <- vegan::diversity(tierNwide_df, index = diversity_metric)
  # Prepare result as a dataframe
  result <- data.frame(
    species = rownames(tierNwide_df),
    diversity = diversity_values,
    row.names = NULL
  )
  # Rename diversity column
  names(result)[1] <- species
  names(result)[2] <- sprintf('%s_diversity', group)
  return(result)
}

# Calculate simpson's diversity per cluster
clusterMetrics <- diversityPerGroup(se@meta.data,
                        species = 'RNA_snn_res.0.05',
                        group = 'sample_id',
                        diversity_metric = 'simpson')

# Calculate number of cells per cluster and join to metrics table
clusterMetrics <- clusterMetrics %>% left_join(se@meta.data %>% count(RNA_snn_res.0.05))
head(clusterMetrics)
  RNA_snn_res.0.05 sample_id_diversity    n
1                0           0.9076305 7482
2                1           0.8885202 2773
3                2           0.8728958 2493
4                3           0.8566796 2393
5                4           0.8393953 1963
6                5           0.6777248 1835

Let’s visualize Simpson’s diversity index by cluster

ggplot(clusterMetrics, aes(x = forcats::fct_reorder(RNA_snn_res.0.05, -n), y = n)) +
  geom_segment(aes(
          x = RNA_snn_res.0.05, xend = RNA_snn_res.0.05,
          y = 0, yend = n),
      size = 1.5, color = 'grey80') + # Draw lines for lollipops
  geom_point(aes(color = `sample_id_diversity`), size = 5) + # Add colored lollipop 'heads', coloring by 'Sample ID_diversity'
  scale_y_log10() +
  # scale_x_continuous(breaks = seq(0, 20)) + 
  scale_colour_viridis(
      option = "C",
      name = "sample_id_diversity",
      direction = 1,
      limits = c(0, 1)) + # Colorblind-friendly, vibrant color palette
  theme_minimal(base_size = 10) +
  theme(legend.position = "bottom",
        axis.text = element_text(size = 12), 
        axis.title = element_text(size = 14), 
        title = element_text(size = 16)) +
  labs(x = "Clusters",
       y = "log(cluster size)",
       title = "Simpson Diversity Index per Cluster") +
  guides(colour = guide_colourbar(title.position = "top", title.hjust = 0.5))

Clusters 5 & 11 appear to have lower sample diversity. Let’s investigate which samples are in each cluster to see if there is any bias.

Plot sample distribution per cluster

# Prepare the data for plotting
plot_sample <- se@meta.data %>%
  count(RNA_snn_res.0.05, sample_id) %>%
  group_by(RNA_snn_res.0.05) %>%
  mutate(proportion = n / sum(n)) %>%
  ungroup()

# Create a stacked bar plot
ggplot(
    plot_sample,
    aes(x = factor(RNA_snn_res.0.05),
        y = proportion,
        fill = sample_id)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(
      title = "Distribution of Sample IDs Across Clusters",
      x = "Seurat Clusters",
      y = "Proportion",
      fill = "Sample ID") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = donor_pal)

Sample T036 appears to make up the majority of Cluster 5 and T176 of cluster 11. Looking at it with a table -

table(se$sample_id, se$RNA_snn_res.0.05)
      
          0    1    2    3    4    5    6    7    8    9   10   11
  T024  707   12   61  118    3    0   38   74   14   27    8    3
  T036  433  384  116   55  548  980   24   13   20    5    1    3
  T44   689  211  138  258    3    0   55   78    8   46    7    6
  T057  750  311  166  217   90   91  128   60   31   48   18    1
  T110 1070    0   10    8    0    9    9   24   12    1    0    0
  T160  208  153   75   38  183  141   44   26   12    5    2    4
  T161  169  223   45   30  169  267   15   10   15    2    5    7
  T182  533  124  157   98  139   81   19   31   40   18   13    1
  T017  317  155   45   54  129   90   41   34  128    0    8    7
  T019  212  256  234  207  418   55  114   36   35   17    8   10
  T176  283  577  687  702  181   79  415   41  310    1   53  101
  T189   11   21  207  128    4    4   20  240   38   64    0    4
  T197 1108  154  270  325   33    1  118  183   45   81   40    2
  T203  505   10   72   25    1    1   74   25   23    1    1    2
  T202  487  182  210  130   62   36   68   87   25   17    9    6

Another way to visualize cluster diversity is through a stacked bar plot instead of lollipops. This provides a super clear way to compare these numerical metrics side by side.

ggplot(clusterMetrics,
       aes(
           x = 1, 
           y = as.character(RNA_snn_res.0.05),
           fill = sample_id_diversity,
           label = n)) +
  geom_tile(colour = "white") +
  geom_text(nudge_x = 1.5, size = 3) +
  geom_text(aes(label = signif(sample_id_diversity, 2)),size = 3) +
  scale_fill_distiller(palette = "Spectral", limits = c(0, 1)) + theme_classic() +
  coord_fixed(ratio = 0.25) + 
  expand_limits(x = c(0.5, 3)) +
  labs(
      x = "Diversity                                              Size",
      y = "RNA_snn_res.0.05",
      fill = "Simpson Diversity\nfor Sample") +
  theme(axis.text.y = element_text(hjust = 1, vjust = 0.5, size = 12),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.y = element_text(size = 15),
        legend.key.size = unit(1, 'cm'),
        legend.title = element_text(size = 10), 
        legend.text = element_text(size = 10)
        )

Visualizing cluster diversity based on batch is also an important thing to look out for. This is similar to the QC motivation where these plots provide a way to explain your data and possible findings before claiming any biological phenomena.

# Prepare the data for plotting
se@meta.data %>%
    count(RNA_snn_res.0.05, batch) %>%
    group_by(RNA_snn_res.0.05) %>%
    mutate(
        proportion = n / sum(n),
        batch = as.character(batch)) %>%
    ungroup() %>%
    # Create a stacked bar plot
    ggplot(
        aes(x = factor(RNA_snn_res.0.05),
            y = proportion,
            fill = batch)) +
    geom_bar(stat = "identity", position = "stack") +
    labs(x = "Seurat Clusters", y = "Proportion", fill = "Batch",
       title = "Distribution of Batches Across Clusters") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    scale_color_manual(values = donor_pal)

Silhouette Analysis

As covered in our workshop, silhouette analysis is a way to measure how similar each cell is to the other cells in its own cluster compared to ones in other clusters. The silhouette value ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters and vice versa.

seurat_clusters <- as.character(se@meta.data$RNA_snn_res.0.05)

pca_embeddings <- Embeddings(se, reduction = 'pca')

# Calculate silhouette widths
sil_scores <- silhouette(x = as.numeric(seurat_clusters), dist = dist(pca_embeddings))

# Extract silhouette scores
silhouette_data <- as.data.frame(sil_scores)
head(silhouette_data)
  cluster neighbor     sil_width
1       1        4 -0.0681468715
2       0        7 -0.1685411885
3       1        4 -0.0006979364
4       1        4  0.0394433860
5       0        7  0.0483166334
6       2        4  0.4294661834
# recover cell type names
row.names(silhouette_data) <- row.names(pca_embeddings)

silhouette_arranged <- silhouette_data %>% 
    mutate(cluster = as.character(cluster)) %>%
    group_by(cluster) %>% 
    arrange(-sil_width)

Visualize silhouette scores

overall_average <- silhouette_arranged %>% 
  ungroup %>% 
  summarize(ave = as.numeric(mean(sil_width))) %>% 
  pull(ave)

ggplot(
    silhouette_arranged, 
    aes(x = sil_width, 
        y = cluster, 
        fill = cluster, 
        group = cluster)) +
    geom_bar(stat = "identity", position = 'dodge2') +
    geom_vline(xintercept = overall_average,
               color = 'red',
               linetype = 'longdash') +
    theme_minimal() +
    labs(title = "Silhouette Analysis",
        y = "Cluster",
        x = "Silhouette width",
        fill = "Cluster") +
    theme(axis.text.y = element_text(hjust = 1, vjust = 0.5, size = 20),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.y = element_text(size = 20),
        legend.position = "None")

The dashed red lines shows the average silhouette width across all clusters. Clusters 1 and 5 stick out here and we can look at their distribution on a UMAP for more context.

So let’s look at silhouette scores on a UMAP

d5 <- DimPlot(se,
        reduction = 'umap',
        group.by = 'RNA_snn_res.0.05', 
        label = TRUE)

se$CellID <- rownames(se@meta.data)

sil_ids <- silhouette_data %>%
    rownames_to_column('CellID') %>%
    left_join(se@meta.data, by = 'CellID')
se <- AddMetaData(se, sil_ids)

FeaturePlot(
  se, 
  feature = "sil_width") + 
  ggtitle('Silhouette width') + 
  scale_color_distiller(
      type = "div",
      palette = "BrBG") | d5

Parts of cluster 0, 1, 5, and 6 seem to have reginos with a specifically poor silhouette score. This potentially means these cells are not clustered correctly. This could be due to this cluster being under-clustered - leading to multiple cell types/states ending up within it. Let’s check if there is heterogeneity within cluster 5. In the next notebook we will take a closer look on what might be going on

Save Seurat object

saveRDS(se, file = "../data/clustered_se.rds")

Extra: More on Clustering Algorithms

The Louvain algorithm was developed in 2008 and is a popular community detection algorithm used in scRNA-seq. It recursively merges communities into a single node and executes the modularity clustering on the condensed graphs.(Zhang) Both Seurat and scanpy use Louvain as the default clustering algorithm.


Leiden Algorithm

The Leiden algorithm was published in 2020 as an improvement of the Louvain algorithm. Leiden creates clusters by taking into account the number of links between cells in a cluster versus the overall expected number of links in the dataset. It computes a clustering on a KNN graph obtained from the PC reduced expression space. It starts with an initial partition where each node from its own community. Next, the algorithm moves single nodes from one community to another to find a partition, which is then refined. Based on a refined partition an aggregate network is generated, which is again refined until no further improvements can be obtained, and the final partition is reached. (Single Cell Best Practices).

There are a couple of popular clustering algorithms. There is no one way to cluster as clustering is a means of looking at the data from different angles. The most popular clustering algorithms are the louvain algorithm, leiden algorithm, hierarchical clustering, and k-means clustering. Seurat uses the Leiden algorithm by default which is an improvement on the Louvain algorithm.

Session Info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggraph_2.2.1         tidygraph_1.3.1      ggalluvial_0.12.5    viridis_0.6.5        viridisLite_0.4.2    cluster_2.1.8        DoubletFinder_2.0.4  colorBlindness_0.1.9 RColorBrewer_1.1-3   lubridate_1.9.4      forcats_1.0.0        stringr_1.5.1        purrr_1.0.2          readr_2.1.5          tidyr_1.3.1          tibble_3.2.1         ggplot2_3.5.1        tidyverse_2.0.0      Seurat_5.2.1         SeuratObject_5.0.2   sp_2.1-4             dplyr_1.1.4         

loaded via a namespace (and not attached):
  [1] rstudioapi_0.17.1      jsonlite_1.8.9         magrittr_2.0.3         spatstat.utils_3.1-2   farver_2.1.2           rmarkdown_2.29         vctrs_0.6.5            ROCR_1.0-11            memoise_2.0.1          spatstat.explore_3.3-4 htmltools_0.5.8.1      gridGraphics_0.5-1     sctransform_0.4.1      parallelly_1.41.0      KernSmooth_2.23-26     htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9             plotly_4.10.4          zoo_1.8-12             cachem_1.1.0           igraph_2.1.2           mime_0.12              lifecycle_1.0.4        pkgconfig_2.0.3        Matrix_1.6-5           R6_2.5.1               fastmap_1.2.0          fitdistrplus_1.2-2     future_1.34.0          shiny_1.10.0           digest_0.6.37          colorspace_2.1-1       patchwork_1.3.0        tensor_1.5             RSpectra_0.16-2        irlba_2.3.5.1          vegan_2.6-8            labeling_0.4.3         progressr_0.15.1       timechange_0.3.0       spatstat.sparse_3.1-0  mgcv_1.9-1             httr_1.4.7             polyclip_1.10-7        abind_1.4-8            compiler_4.3.1         withr_3.0.2            fastDummies_1.7.5      ggforce_0.4.2          MASS_7.3-60           
 [52] permute_0.9-7          tools_4.3.1            lmtest_0.9-40          httpuv_1.6.15          future.apply_1.11.3    goftest_1.2-3          glue_1.8.0             nlme_3.1-166           promises_1.3.2         grid_4.3.1             Rtsne_0.17             reshape2_1.4.4         generics_0.1.3         gtable_0.3.6           spatstat.data_3.1-4    tzdb_0.4.0             data.table_1.16.4      hms_1.1.3              spatstat.geom_3.3-4    RcppAnnoy_0.0.22       ggrepel_0.9.6          RANN_2.6.2             pillar_1.10.1          spam_2.11-0            RcppHNSW_0.6.0         later_1.4.1            splines_4.3.1          tweenr_2.0.3           lattice_0.22-6         survival_3.8-3         deldir_2.0-4           tidyselect_1.2.1       miniUI_0.1.1.1         pbapply_1.7-2          knitr_1.49             gridExtra_2.3          scattermore_1.2        xfun_0.50              graphlayouts_1.2.1     matrixStats_1.5.0      stringi_1.8.4          lazyeval_0.2.2         yaml_2.3.10            evaluate_1.0.3         codetools_0.2-20       cli_3.6.3              uwot_0.1.16            xtable_1.8-4           reticulate_1.40.0      munsell_0.5.1          Rcpp_1.0.14           
[103] globals_0.16.3         spatstat.random_3.3-2  png_0.1-8              spatstat.univar_3.1-1  parallel_4.3.1         dotCall64_1.2          listenv_0.9.1          scales_1.3.0           ggridges_0.5.6         rlang_1.1.4            cowplot_1.1.3